import SetupEnv from './common/setup-env.mdx';

# Integrate with any interface (preview)

From Midscene v0.28.0, we have launched the feature to integrate with any interface. Define your own interface controller class conforming to the `AbstractInterface` class, and you can get a fully-featured Midscene Agent.

A typical usage of this feature is to build a GUI Automation Agent for your own interface, like an IoT device, an in-house app, a display in car, etc. 

After implementing the class, you can have all these popular features:

- the GUI Automation Agent SDK
- the playground for debugging
- control this interface by yaml script
- the MCP server
- all the features of Midscene Agent

Please note that only models with visual grounding capabilities can be used to control the interface. Read this doc to [choose a model](./choose-a-model).

:::tip This is a preview feature

This feature is still in the preview stage, and we welcome your feedback and suggestions on [GitHub](https://github.com/web-infra-dev/midscene/issues).

:::

## Demo and community project

We have prepared a demo project for you to learn how to define your own interface class. It's highly recommended to check it out.

* [Demo Project](https://github.com/web-infra-dev/midscene-example/tree/main/custom-interface) - A simple demo project that shows how to define your own interface class

* [Android (adb) Agent](https://github.com/web-infra-dev/midscene/blob/main/packages/android/src/device.ts) - This is the Android (adb) Agent for Midscene that implements this feature

* [iOS (WebDriverAgent) Agent](https://github.com/web-infra-dev/midscene/blob/main/packages/ios/src/device.ts) - This is the iOS (WebDriverAgent) Agent for Midscene that implements this feature

There are also some community projects that use this feature:

* [midscene-ios](https://github.com/lhuanyu/midscene-ios) - A project driving the OSX "iPhone Mirroring" app with Midscene

<SetupEnv />

## Implement your own interface class

### Key concepts

* The `AbstractInterface` class: a predefined abstract class that can connect to the Midscene Agent
* The **action space**: a set of actions that describe the actions that can be performed on the interface. This will affect how the AI model plans the actions and executes them

### Step 1. Clone and setup from the demo project

We provide a demo project that runs all the features of this document below. It's the fastest way to get started.

```bash
# prepare the environment
git clone https://github.com/web-infra-dev/midscene-example.git
cd midscene-example/custom-interface
npm install
npm run build

# run the demo
npm run demo
```

### Step 2. Implement your interface class

Define a class that extends the `AbstractInterface` class, and implement the required methods.

You can get the sample implementation from the [`./src/sample-device.ts`](https://github.com/web-infra-dev/midscene-example/blob/main/custom-interface/src/sample-device.ts) file. Let's take a glance at it.

```typescript
import type { DeviceAction, Size } from '@midscene/core';
import { getMidsceneLocationSchema, z } from '@midscene/core';
import {
  type AbstractInterface,
  defineAction,
  defineActionTap,
  defineActionInput,
  // ... other action imports
} from '@midscene/core/device';

export interface SampleDeviceConfig {
  deviceName?: string;
  width?: number;
  height?: number;
  dpr?: number;
}

/**
 * SampleDevice - A template implementation of AbstractInterface
 */
export class SampleDevice implements AbstractInterface {
  interfaceType = 'sample-device';
  private config: Required<SampleDeviceConfig>;

  constructor(config: SampleDeviceConfig = {}) {
    this.config = {
      deviceName: config.deviceName || 'Sample Device',
      width: config.width || 1920,
      height: config.height || 1080,
      dpr: config.dpr || 1,
    };
  }

  /**
   * Required: Take a screenshot and return base64 string
   */
  async screenshotBase64(): Promise<string> {
    // TODO: Implement actual screenshot capture
    console.log('📸 Taking screenshot...');
    return 'data:image/png;base64,...'; // Your screenshot implementation
  }

  /**
   * Required: Get interface dimensions
   */
  async size(): Promise<Size> {
    return {
      width: this.config.width,
      height: this.config.height,
      dpr: this.config.dpr,
    };
  }

  /**
   * Required: Define available actions for AI model
   */
  actionSpace(): DeviceAction[] {
    return [
      // Basic tap action
      defineActionTap(async (param) => {
        // TODO: Implement tap at param.locate.center coordinates
        await this.performTap(param.locate.center[0], param.locate.center[1]);
      }),

      // Text input action  
      defineActionInput(async (param) => {
        // TODO: Implement text input
        await this.performInput(param.locate.center[0], param.locate.center[1], param.value);
      }),

      // Custom action example
      defineAction({
        name: 'CustomAction',
        description: 'Your custom device-specific action',
        paramSchema: z.object({
          locate: getMidsceneLocationSchema(),
          // ... custom parameters
        }),
        call: async (param) => {
          // TODO: Implement custom action
        },
      }),
    ];
  }

  async destroy(): Promise<void> {
    // TODO: Cleanup resources
  }

  // Private implementation methods
  private async performTap(x: number, y: number): Promise<void> {
    // TODO: Your actual tap implementation
  }

  private async performInput(x: number, y: number, text: string): Promise<void> {
    // TODO: Your actual input implementation  
  }
}
```

The key methods that you need to implement are:
- `screenshotBase64()`, `size()`: help the AI model to get the context of the interface
- `actionSpace()`: an array of `DeviceAction` objects defining the actions that can be performed on the interface. AI model will use these actions to perform the actions. Midscene has provided a set of predefined action spaces for the most common interfaces and devices. And there is also a method to define any custom action.

Use these commands to run the agent:

- `npm run build` to rebuild the agent
- `npm run demo` to run the agent with javascript
- `npm run demo:yaml` to run the agent with yaml script


### Step 3. Test the agent with the playground

Attach a playground server to the agent, and you can test the agent in the web browser.

```ts 
import 'dotenv/config'; // read Midscene environment variables from .env file
import { playgroundForAgent } from '@midscene/playground';

const sleep = (ms) => new Promise((r) => setTimeout(r, ms));

// instantiate device and agent
const device = new SampleDevice();
await device.launch();
const agent = new Agent(device);

// launch playground
const server = await playgroundForAgent(agent).launch();

// close playground
await sleep(10 * 60 * 1000);
await server.close();
console.log('Playground closed!');
```

### Step 4. Test the MCP server

(still in progress)

### Step 5. Release the npm package, and let your users use it

The agent and interface class have been exported in `./index.ts` file. Now you can publish it to npm.

Fill the `name` and `version` in the `package.json` file, and then run the following command:

```bash
npm publish
```

A typical usage of your npm package is like this:

```typescript
import 'dotenv/config'; // read Midscene environment variables from .env file
import { playgroundForAgent } from '@midscene/playground';

const sleep = (ms) => new Promise((r) => setTimeout(r, ms));

// instantiate device and agent
const device = new SampleDevice();
await device.launch();
const agent = new Agent(device);

await agent.aiAction('click the button');
```

### Step 6. Invoke your class in Midscene CLI and YAML script

Write a yaml script with the `interface` section to invoke your class.

```yaml
interface:
  module: 'my-pkg-name'
  # export: 'MyDeviceClass' # use this if this is a named export

config: 
  output: './data.json'
```

This config works same as this:
```typescript
import MyDeviceClass from 'my-pkg-name';
const device = new MyDeviceClass();
const agent = new Agent(device, {
  output: './data.json',
});
```

Other fields in the yaml script are the same as the [yaml script](./automate-with-scripts-in-yaml.html).

## API reference

### `AbstractInterface` class

```typescript
import { AbstractInterface } from '@midscene/core';
```

`AbstractInterface` is the key class for the agent to control the interface. 

These are the required methods that you need to implement:

- `interfaceType: string`: define a name for the interface, this will not be provided to the AI model
- `screenshotBase64(): Promise<string>`: take a screenshot of the interface and return the base64 string with the `'data:image/` prefix
- `size(): Promise<Size>`: the size and dpr of the interface, which is an object with the `width`, `height`, and `dpr` properties
- `actionSpace(): DeviceAction[] | Promise<DeviceAction[]>`: the action space of the interface, which is an array of `DeviceAction` objects. Use predefined actions or define any custom action.

Type signatures:

```ts
import type { DeviceAction, Size, UIContext } from '@midscene/core';
import type { ElementNode } from '@midscene/shared/extractor';

abstract class AbstractInterface {
  // Required
  abstract interfaceType: string;
  abstract screenshotBase64(): Promise<string>;
  abstract size(): Promise<Size>;
  abstract actionSpace(): DeviceAction[] | Promise<DeviceAction[]>;

  // Optional lifecycle/hooks
  abstract destroy?(): Promise<void>;
  abstract describe?(): string;
  abstract beforeInvokeAction?(actionName: string, param: any): Promise<void>;
  abstract afterInvokeAction?(actionName: string, param: any): Promise<void>;
}
```

These are the optional methods that you can implement:

- `destroy?(): Promise<void>`: destroy the interface
- `describe?(): string`: describe the interface, this may be used for the report and the playground. But it will not be provided to the AI model.
- `beforeInvokeAction?(actionName: string, param: any): Promise<void>`: a hook function before invoking an action in action space
- `afterInvokeAction?(actionName: string, param: any): Promise<void>`: a hook function after invoking an action

### The action space

Action space is the set of actions that can be performed on the interface. AI model will use these actions to perform the actions. All the descriptions and parameter schemas of the actions will be provided to the AI model.

To help you easily define the action space, Midscene has provided a set of predefined action spaces for the most common interfaces and devices. And there is also a method to define any custom action.

This is how you can import the utils to define the action space:

```typescript
import {
	type ActionTapParam,
	defineAction,
	defineActionTap,
} from "@midscene/core/device";
```

#### The predefined actions

These are the predefined action spaces for the most common interfaces and devices. You can expose them to the customized interface by implementing the call method of the action.

You can find the parameters of the actions in the type definition of these functions.

* `defineActionTap()`: define the tap action. This is also the function to invoke for the `aiTap` method.
* `defineActionDoubleClick()`: define the double click action
* `defineActionInput()`: define the input action. This is also the function to invoke for the `aiInput` method. This is also the function to invoke for the `aiInput` method.
* `defineActionKeyboardPress()`: define the keyboard press action. This is also the function to invoke for the `aiKeyboardPress` method.
* `defineActionScroll()`: define the scroll action. This is also the function to invoke for the `aiScroll` method.
* `defineActionDragAndDrop()`: define the drag and drop action
* `defineActionLongPress()`: define the long press action
* `defineActionSwipe()`: define the swipe action

#### Define a custom action

You can define your own action by using the `defineAction()` function. You can also use this method to define more actions for the [PuppeteerAgent](./integrate-with-puppeteer), [AgentOverChromeBridge](./bridge-mode-by-chrome-extension#constructor), and [AndroidAgent](./integrate-with-android).

API Signature:

```typescript
import { defineAction } from "@midscene/core/device";

defineAction(
  {
    name: string,
    description: string,
    paramSchema: z.ZodType<T>;
    call: (param: z.infer<z.ZodType<T>>) => Promise<void>;
  }
)
```

* `name`: the name of the action, AI model will use this name to invoke the action
* `description`: the description of the action, AI model will use this description to understand what the action is doing. For complex actions, you can provide a more detailed example here.
* `paramSchema`: the [Zod](https://www.npmjs.com/package/zod) schema of the parameters of the action, AI model will help to fill the parameters according to this schema
* `call`: the function to invoke the action, you can get the parameters from the `param` parameter which conforms to the `paramSchema`


Example:

```typescript
defineAction({
  name: 'MyAction',
  description: 'My action',
  paramSchema: z.object({
    name: z.string(),
  }),
  call: async (param) => {
    console.log(param.name);
  },
});
```

If you want to get a param about the location of some element, you can use the `getMidsceneLocationSchema()` function to get the specific zod schema.

A more complex example about defining a custom action:

```typescript
import { getMidsceneLocationSchema } from "@midscene/core/device";

defineAction({
  name: 'LaunchApp',
  description: 'A an app on screen',
  paramSchema: z.object({
    name: z.string().describe('The name of the app to launch'),
    locate: getMidsceneLocationSchema().describe('The app icon to be launched'),
  }),
  call: async (param) => {
    console.log(`launching app: ${param.name}, ui located at: ${JSON.stringify(param.locate.center)}`);
  },
});
```

### `playgroundForAgent` function

```typescript
import { playgroundForAgent } from '@midscene/playground';
```

The `playgroundForAgent` function creates a playground launcher for a specific Agent, allowing you to test and debug your custom interface Agent in a web browser.

#### Function signature

```typescript
function playgroundForAgent(agent: Agent): {
  launch(options?: LaunchPlaygroundOptions): Promise<LaunchPlaygroundResult>
}
```

#### Parameters

- `agent: Agent`: The Agent instance to launch the playground for

#### Return value

Returns an object containing a `launch` method.

#### `launch` method options

```typescript
interface LaunchPlaygroundOptions {
  /**
   * Port to start the playground server on
   * @default 5800
   */
  port?: number;

  /**
   * Whether to automatically open the playground in browser
   * @default true
   */
  openBrowser?: boolean;

  /**
   * Custom browser command to open playground
   * @default 'open' on macOS, 'start' on Windows, 'xdg-open' on Linux
   */
  browserCommand?: string;

  /**
   * Whether to show server logs
   * @default true
   */
  verbose?: boolean;

  /**
   * Unique identifier for the playground server instance
   * Same ID shares playground chat history
   * @default undefined (generates random UUID)
   */
  id?: string;
}
```

#### `launch` method return value

```typescript
interface LaunchPlaygroundResult {
  /**
   * The playground server instance
   */
  server: PlaygroundServer;

  /**
   * The server port
   */
  port: number;

  /**
   * The server host
   */
  host: string;

  /**
   * Function to close the playground
   */
  close: () => Promise<void>;
}
```

#### Usage example

```typescript
import 'dotenv/config';
import { playgroundForAgent } from '@midscene/playground';
import { SampleDevice } from './sample-device';
import { Agent } from '@midscene/core/agent';

const sleep = (ms) => new Promise((r) => setTimeout(r, ms));

// Create device and agent instances
const device = new SampleDevice();
const agent = new Agent(device);

// Launch playground
const result = await playgroundForAgent(agent).launch({
  port: 5800,
  openBrowser: true,
  verbose: true
});

console.log(`Playground started: http://${result.host}:${result.port}`);

// Close playground when needed
await sleep(10 * 60 * 1000); // Wait 10 minutes
await result.close();
console.log('Playground closed!');
```

## FAQ 

**Can i use normal LLM models like GPT-4o to control the interface?**

No, you cannot use normal LLM models like GPT-4o to control the interface. You must use a model with visual grounding capabilities. Models with visual grounding capabilities can locate the target elements on the page and return the coordinates of the elements, and they can dramatically improve the stability of the automation.

Read this doc to [choose a model](./choose-a-model).

**Can my interface-controller be recommended in this document?**

Yes, we are happy to gather creative projects and list them in this document.

[Raise an issue](https://github.com/web-infra-dev/midscene/issues) to us when it's ready.
